In this module, we’ll cover essential steps for preprocessing and quality control of 16S rRNA sequencing data. Effective data preprocessing ensures that our analyses are based on high-quality data, minimizing errors and maximizing accuracy. This module will guide you through importing raw sequencing data, performing quality filtering, and removing potential errors such as chimeric sequences.
We’ll explain the FastQ file formats and use DADA2 for quality assessment and feature extraction. This hands-on practice will import and process real data from the Nevada Wolfpack Study which we will use in the next module for downstream analyses.
Our starting point is demultiplexed samples sorted in individual files. Demultiplexing is a crucial step in processing high-throughput sequencing data when multiple samples are sequenced together in a single run (a process called multiplexing). Multiplexing allows researchers to save time and resources by sequencing multiple samples at once, but it introduces the need to sort, or “demultiplex,” the data afterward.
\(~\)
In multiplexing, unique DNA “barcode” sequences are added to each sample before sequencing. During demultiplexing, these barcodes are used to identify which sequences belong to each original sample. The demultiplexing step occurs immediately after the sequencing machine outputs the raw data.
\(~\)
Demultiplexing software scans the data, recognizes the barcodes, and then separates sequences into individual files for each sample. Once demultiplexed, the data can be analyzed separately for each sample, moving on to further steps such as quality control, trimming, and chimera removal. Demultiplexing is typically performed by the sequencing platform’s software and files of individual samples are given to the researcher.
Working with microbiome data can be challenging. Traditional analysis methods cluster sequences based on similarity (operational taxonomic units (OTUs)), often missing the true biological diversity in the sample and sometimes lumping together different species. This can reduce the accuracy of the analysis, especially when studying complex microbial communities. Also, errors are often introduced during the sequencing process, such as substitutions (wrong bases) and chimera (sequences formed by two different DNA strands merging). These errors must be properly handled to avoid incorrect species identification.
We will use DADA2 to load, process, and quality control our sequencing files. DADA2 is a widely used software package in bioinformatics for analyzing microbial communities from DNA sequencing data and handling errors. It works particularly well in processing 16S rRNA data.
DADA2 takes a new approach by attempting to resolve individual sequences down to the level of single-nucleotide differences while correcting for sequencing errors1. This approach is called Amplicon Sequence Variants (ASVs), which are highly accurate sequences that represent actual biological diversity in the sample without clustering.
\(~\)
The DADA2 workflow includes several key steps:
1. Inspecting Read Quality Profiles: Before processing, DADA2 allows us to visualize the quality scores across all sequences. Quality scores provide an estimate of the accuracy of each base call in a sequence. By inspecting these scores, we can identify where the quality drops, which helps determine appropriate trimming parameters.
2. Filtering and Trimming: Low-quality sequences and ambiguous bases can introduce errors in downstream analysis. In this step, we set quality thresholds to remove reads that don’t meet quality standards. Trimming removes low-quality ends of reads based on quality scores, helping to retain only high-quality sequences for further analysis.
3. Learning Error Rates: DADA2 builds a model of the specific error rates associated with the sequencing run. Since error patterns vary between sequencing technologies and even between runs, DADA2 analyzes a subset of data to estimate these unique error rates. This error model helps DADA2 accurately correct errors in the sequencing reads.
4. Inferring Sample Composition (ASV Inference): Using the error model, DADA2 corrects sequencing errors to identify unique, true biological sequences known as Amplicon Sequence Variants (ASVs). ASVs provide finer resolution than Operational Taxonomic Units (OTUs) and often represent exact biological sequences rather than clustered approximations.
5. Merging Paired Reads: Since we are using paired-end sequencing, this step merges the forward and reverse reads that cover overlapping regions of the same DNA fragment. The merged reads typically provide higher accuracy by combining information from both ends, but reads that fail to overlap are discarded.
6. Constructing a Sequence Table: After ASVs are identified, DADA2 constructs a sequence table, which is a matrix of samples and their ASV counts. This table serves as the primary data structure for downstream analysis, showing how many times each ASV appears in each sample.
7. Removing Chimeras: Chimeric sequences are artifacts from PCR amplification, formed by combining segments from different DNA templates. DADA2 identifies these artificial sequences by comparing ASVs and removes them to reduce false positives in microbial identification.
8. Tracking Reads Through the Pipeline: Throughout the pipeline, DADA2 tracks the number of reads retained at each step. This read-tracking helps to monitor data quality and determine where reads may have been lost, providing a quality control check on each processing stage.
9. Assigning Taxonomy: After obtaining high-quality ASVs, DADA2 compares them to a reference database (e.g., SILVA, Greengenes) to assign taxonomic labels. This step allows us to identify the microbial taxa present in each sample down to species, genus, or other taxonomic levels, depending on the reference database used.
This process gives us highly accurate and reproducible results. We will start by installing and loading DADA2 and necessary packages.
\(~\)
#Install the DADA2 and other packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("dada2", quietly = TRUE)) {
BiocManager::install("dada2")
}
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
if (!requireNamespace("patchwork", quietly = TRUE)) {
devtools::install("patchwork")
}
#Load the packages
library(dada2)
library(ggplot2)
library(patchwork)
Our sequencing files are stored in FASTQ format, a common file type for sequencing data. Each entry (or sequence) in a FASTQ file contains four lines: the sequence identifier, the DNA sequence, a separator, and a quality score line. This format is compact and efficient, making it essential for storing large-scale sequencing data.
\(~\)
\(~\)
FASTQ files follow the same format with each sequence entry:
1. Sequence Identifier: Starts with an “@” symbol, followed by a unique identifier for the sequence. It may also contain additional details, such as the instrument used and the read number.
2. DNA Sequence: The actual sequence of DNA bases (A, T, C, G) observed in the sample. The length of this line is equal to the number of bases sequenced.
3. Separator Line: A “+” symbol acts as a separator between the sequence and its quality scores.
4. Quality Scores: A string of characters representing the quality of each base in the sequence. Each character encodes a Phred quality score, where higher scores indicate more confidence in the accuracy of the base call. The quality string is the same length as the sequence line.
\(~\)
Now we will download our data and import it with DADA2.
\(~\)
# Download data
system("wget -r -np -R 'index.html*' --no-check-certificate --cut-dirs=2 -nH https://biox.unr.edu/ftp/16S_Training_Module/")
# Check files
path <- "fastqs"
list.files(path)
## [1] "WP-11O29_S66_L001_R1_001.fastq" "WP-11O29_S66_L001_R2_001.fastq"
## [3] "WP-1641C_S68_L001_R1_001.fastq" "WP-1641C_S68_L001_R2_001.fastq"
## [5] "WP-1J0A0_S50_L001_R1_001.fastq" "WP-1J0A0_S50_L001_R2_001.fastq"
## [7] "WP-3GH4E_S51_L001_R1_001.fastq" "WP-3GH4E_S51_L001_R2_001.fastq"
## [9] "WP-4SP0U_S57_L001_R1_001.fastq" "WP-4SP0U_S57_L001_R2_001.fastq"
## [11] "WP-6RCV3_S39_L001_R1_001.fastq" "WP-6RCV3_S39_L001_R2_001.fastq"
## [13] "WP-7FAEC_S47_L001_R1_001.fastq" "WP-7FAEC_S47_L001_R2_001.fastq"
## [15] "WP-8SXNH_S53_L001_R1_001.fastq" "WP-8SXNH_S53_L001_R2_001.fastq"
## [17] "WP-8V6E3_S72_L001_R1_001.fastq" "WP-8V6E3_S72_L001_R2_001.fastq"
## [19] "WP-95020_S58_L001_R1_001.fastq" "WP-95020_S58_L001_R2_001.fastq"
## [21] "WP-9ZJJI_S55_L001_R1_001.fastq" "WP-9ZJJI_S55_L001_R2_001.fastq"
## [23] "WP-ATJLX_S75_L001_R1_001.fastq" "WP-ATJLX_S75_L001_R2_001.fastq"
## [25] "WP-DFG11_S74_L001_R1_001.fastq" "WP-DFG11_S74_L001_R2_001.fastq"
## [27] "WP-EC5AP_S42_L001_R1_001.fastq" "WP-EC5AP_S42_L001_R2_001.fastq"
## [29] "WP-G375R_S73_L001_R1_001.fastq" "WP-G375R_S73_L001_R2_001.fastq"
## [31] "WP-G9EBL_S40_L001_R1_001.fastq" "WP-G9EBL_S40_L001_R2_001.fastq"
## [33] "WP-J06MK_S31_L001_R1_001.fastq" "WP-J06MK_S31_L001_R2_001.fastq"
## [35] "WP-J0B3X_S33_L001_R1_001.fastq" "WP-J0B3X_S33_L001_R2_001.fastq"
## [37] "WP-KM8BK_S62_L001_R1_001.fastq" "WP-KM8BK_S62_L001_R2_001.fastq"
## [39] "WP-KXGT6_S37_L001_R1_001.fastq" "WP-KXGT6_S37_L001_R2_001.fastq"
## [41] "WP-MRCZG_S36_L001_R1_001.fastq" "WP-MRCZG_S36_L001_R2_001.fastq"
## [43] "WP-QIGEC_S59_L001_R1_001.fastq" "WP-QIGEC_S59_L001_R2_001.fastq"
## [45] "WP-RAJES_S35_L001_R1_001.fastq" "WP-RAJES_S35_L001_R2_001.fastq"
## [47] "WP-RPV50_S49_L001_R1_001.fastq" "WP-RPV50_S49_L001_R2_001.fastq"
## [49] "WP-SUFNU_S34_L001_R1_001.fastq" "WP-SUFNU_S34_L001_R2_001.fastq"
## [51] "WP-TDY7G_S32_L001_R1_001.fastq" "WP-TDY7G_S32_L001_R2_001.fastq"
## [53] "WP-U4XI6_S44_L001_R1_001.fastq" "WP-U4XI6_S44_L001_R2_001.fastq"
## [55] "WP-UIXW3_S45_L001_R1_001.fastq" "WP-UIXW3_S45_L001_R2_001.fastq"
## [57] "WP-X576L_S56_L001_R1_001.fastq" "WP-X576L_S56_L001_R2_001.fastq"
## [59] "WP-X803O_S69_L001_R1_001.fastq" "WP-X803O_S69_L001_R2_001.fastq"
## [61] "WP-XDI0I_S65_L001_R1_001.fastq" "WP-XDI0I_S65_L001_R2_001.fastq"
## [63] "WP-YF6XT_S61_L001_R1_001.fastq" "WP-YF6XT_S61_L001_R2_001.fastq"
## [65] "WP-YRYYF_S67_L001_R1_001.fastq" "WP-YRYYF_S67_L001_R2_001.fastq"
## [67] "WP-YY8WG_S38_L001_R1_001.fastq" "WP-YY8WG_S38_L001_R2_001.fastq"
## [69] "WP-Z01LM_S43_L001_R1_001.fastq" "WP-Z01LM_S43_L001_R2_001.fastq"
# Forward and reverse fastq filenames have format: SAMPLENAME_L001_R1_001.fastq and SAMPLENAME_L001_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_L001_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_L001_R2_001.fastq", full.names = TRUE))
# Extract sample names
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
Check to make sure our sample names are correct.
\(~\)
# Display sample names and number of samples
cat("Sample names:\n")
## Sample names:
cat(sample.names, sep = "\n")
## WP-11O29
## WP-1641C
## WP-1J0A0
## WP-3GH4E
## WP-4SP0U
## WP-6RCV3
## WP-7FAEC
## WP-8SXNH
## WP-8V6E3
## WP-95020
## WP-9ZJJI
## WP-ATJLX
## WP-DFG11
## WP-EC5AP
## WP-G375R
## WP-G9EBL
## WP-J06MK
## WP-J0B3X
## WP-KM8BK
## WP-KXGT6
## WP-MRCZG
## WP-QIGEC
## WP-RAJES
## WP-RPV50
## WP-SUFNU
## WP-TDY7G
## WP-U4XI6
## WP-UIXW3
## WP-X576L
## WP-X803O
## WP-XDI0I
## WP-YF6XT
## WP-YRYYF
## WP-YY8WG
## WP-Z01LM
cat("There are", length(sample.names), "samples")
## There are 35 samples
An .rds file saves R objects that can be quickly loaded
back into R using readRDS(). While .rds files
are optimized for R and can’t be viewed directly outside of R, they are
efficient for storing individual R objects and preserving their exact
structure. This format is especially useful when you need to save and
reload between sessions.
\(~\)
# Save our data
saveRDS(sample.names, file = 'sampleNames.rds')
We will first look at the quality of the individual bases in the reads by pulling the Phred score for each read in a sample. The heatmap produced shows the frequency of each quality score at each base over all the reads for a given sample. The green line represents the median with orange lines representing quartiles.
\(~\)
Note: Some of these steps take several minutes to run.
\(~\)
# View quality profiles of forward reads for the first 2 samples
plotQualityProfile(fnFs[1:2])
This step will take several minutes to run.
\(~\)
# We can create a plot for all of the samples and
p1 = plotQualityProfile(fnFs) +
ggtitle('Forward Reads')
p2 = plotQualityProfile(fnRs)+
ggtitle('Reverse Reads')
# Save the plot
ggsave('viz_data/plotQualityProfile.png', plot = p1 + p2, width = 18, height = 12, units = 'in', dpi = 600)
We can manually visualize our quality profiles for all of the samples (plotQualityProfile.png) in the viz_data folder. Quality scores typically decrease over sequencing cycles (positions along the read) due to several factors:
\(~\)
# View quality profiles of reverse reads for the first 2 samples
plotQualityProfile(fnRs[1:2])
Reverse reads typically show lower quality scores than forward reads. This is normal in paired-end sequencing data for several reasons:
1. Sequential Nature: Reverse reads are generated after forward ones, when:
2. DADA2’s Handling: DADA2 is designed to work with this quality variation because:
3. Trimming Strategy: While DADA2 can handle lower quality scores, we still trim the poorest quality bases because:
Due to a recent system update on the MiSeq, the sequencing run was inadvertently over-clustered, producing an excess of low-quality reads we see in the plots. Let’s break down what happened and why it matters:
During DNA sequencing, molecules attach to the flow cell surface in spots called “clusters”. Think of it like planting seeds in a garden: - Optimal spacing: Each plant has room to grow and be identified clearly - Over-crowded: Plants grow too close, making it hard to tell them apart
A MiSeq system update led to over-clustering, meaning: - Too many DNA clusters formed on the flow cell - Clusters grew too close together - Their signals interfered with each other - Like trying to read many overlapping words
This is okay for our analysis because quality is more important than quantity, removing unreliable data improves our final results, and we still have sufficient good quality reads for analysis.
High-quality reads typically have consistent Phred scores across their length, with minimal drop-off at the ends.
# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- names(filtRs) <- sample.names
DADA2’s filterAndTrim() function processes raw
sequencing reads to ensure high-quality data for analysis. Here’s a
explanation of each parameter:
truncLen=c(200,130): Truncating read length
maxEE=c(2,5): Maximum “expected errors” allowed
truncQ=2: Cuts reads when quality score drops to Q2
maxN=0: Removes reads with any ambiguous bases (N)trimLeft=c(0,0): No trimming from read startsrm.phix=TRUE: Removes PhiX spike-in control DNA (if you
are using those)# Set trimming standards first
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(200,130),
maxN=0, maxEE=c(2,5), trimLeft = c(0,0), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE)
## Creating output directory: fastqs/filtered
# View outputs
head(out)
## reads.in reads.out
## WP-11O29_S66_L001_R1_001.fastq 114247 39363
## WP-1641C_S68_L001_R1_001.fastq 180306 63123
## WP-1J0A0_S50_L001_R1_001.fastq 118458 41544
## WP-3GH4E_S51_L001_R1_001.fastq 105493 37698
## WP-4SP0U_S57_L001_R1_001.fastq 245187 83598
## WP-6RCV3_S39_L001_R1_001.fastq 114269 40530
While quality scores tell us about the confidence in each base call, DADA2 goes further by learning the specific error patterns in your dataset. Here’s how it works:
Think of it like this: Quality scores are like spell-checking individual words, while DADA2’s error model is like learning someone’s specific typing patterns to better catch their common mistakes.
\(~\)
Note: This is why DADA2 is more sophisticated than simple quality filtering - it understands the unique “personality” of your sequencing run’s errors.
\(~\)
These steps will take several minutes to run.
\(~\)
# Error model for forward reads
errF <- learnErrors(filtFs, multithread=TRUE)
## 108897000 total bases in 544485 reads from 9 samples will be used for learning the error rates.
# Error model for reverse reads
errR <- learnErrors(filtRs, multithread=TRUE)
## 104995670 total bases in 807659 reads from 13 samples will be used for learning the error rates.
# Save our outputs
saveRDS(errF,file="errF.rds")
saveRDS(errR,file="errR.rds")
\(~\)
If you are coming back to this module, you can load your saved
data instead of rerunning. Use the following R code to upload saved
data.
errF <- readRDS("errF.rds")
errR <- readRDS("errR.rds")
\(~\)
We can now plot the estimated error rates. These plots displays the observed error rates across various quality scores for each base position in the reads (A, C, G, T). It allows us to compare the empirical error rates (from the actual data) to the estimated error rates (based on the model DADA2 has learned represented in a black line).
\(~\)
# Plot forward reads estimated error rates
plotErrors(errF, nominalQ=TRUE) +
ggtitle('errF')
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 82 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 82 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Plot reverse reads estimated error rates
plotErrors(errR, nominalQ = TRUE) +
ggtitle('errR')
## Warning: Removed 82 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 82 rows containing missing values or values outside the scale range
## (`geom_line()`).
We will now use DADA2’s learned error rates and identify amplicon sequence variants (ASV) while removing likely errors.
\(~\)
This step will take several minutes to run.
\(~\)
# Creating DADA object
dadaFs <- dada(filtFs, err=errF, multithread=TRUE) # filtered reads in unique sequences
## Sample 1 - 39363 reads in 14187 unique sequences.
## Sample 2 - 63123 reads in 26683 unique sequences.
## Sample 3 - 41544 reads in 17784 unique sequences.
## Sample 4 - 37698 reads in 12290 unique sequences.
## Sample 5 - 83598 reads in 32871 unique sequences.
## Sample 6 - 40530 reads in 14527 unique sequences.
## Sample 7 - 99030 reads in 42889 unique sequences.
## Sample 8 - 39699 reads in 13841 unique sequences.
## Sample 9 - 99900 reads in 39434 unique sequences.
## Sample 10 - 110089 reads in 40284 unique sequences.
## Sample 11 - 75378 reads in 28399 unique sequences.
## Sample 12 - 36181 reads in 12523 unique sequences.
## Sample 13 - 41526 reads in 16087 unique sequences.
## Sample 14 - 31948 reads in 11354 unique sequences.
## Sample 15 - 36964 reads in 13680 unique sequences.
## Sample 16 - 19481 reads in 9230 unique sequences.
## Sample 17 - 77711 reads in 29452 unique sequences.
## Sample 18 - 114355 reads in 42651 unique sequences.
## Sample 19 - 49661 reads in 20678 unique sequences.
## Sample 20 - 45830 reads in 17676 unique sequences.
## Sample 21 - 122888 reads in 46755 unique sequences.
## Sample 22 - 103522 reads in 36545 unique sequences.
## Sample 23 - 106278 reads in 44948 unique sequences.
## Sample 24 - 45389 reads in 18754 unique sequences.
## Sample 25 - 131341 reads in 55136 unique sequences.
## Sample 26 - 49243 reads in 20042 unique sequences.
## Sample 27 - 29334 reads in 10554 unique sequences.
## Sample 28 - 104203 reads in 42992 unique sequences.
## Sample 29 - 43671 reads in 16335 unique sequences.
## Sample 30 - 97132 reads in 41151 unique sequences.
## Sample 31 - 69682 reads in 25638 unique sequences.
## Sample 32 - 72768 reads in 28278 unique sequences.
## Sample 33 - 97360 reads in 33458 unique sequences.
## Sample 34 - 56868 reads in 21112 unique sequences.
## Sample 35 - 77521 reads in 28580 unique sequences.
dadaRs <- dada(filtRs, err=errR, multithread=TRUE)
## Sample 1 - 39363 reads in 15064 unique sequences.
## Sample 2 - 63123 reads in 27307 unique sequences.
## Sample 3 - 41544 reads in 16990 unique sequences.
## Sample 4 - 37698 reads in 12863 unique sequences.
## Sample 5 - 83598 reads in 33883 unique sequences.
## Sample 6 - 40530 reads in 13599 unique sequences.
## Sample 7 - 99030 reads in 35760 unique sequences.
## Sample 8 - 39699 reads in 14950 unique sequences.
## Sample 9 - 99900 reads in 37298 unique sequences.
## Sample 10 - 110089 reads in 37099 unique sequences.
## Sample 11 - 75378 reads in 36755 unique sequences.
## Sample 12 - 36181 reads in 10374 unique sequences.
## Sample 13 - 41526 reads in 13581 unique sequences.
## Sample 14 - 31948 reads in 10654 unique sequences.
## Sample 15 - 36964 reads in 11463 unique sequences.
## Sample 16 - 19481 reads in 8674 unique sequences.
## Sample 17 - 77711 reads in 31860 unique sequences.
## Sample 18 - 114355 reads in 32740 unique sequences.
## Sample 19 - 49661 reads in 20391 unique sequences.
## Sample 20 - 45830 reads in 14569 unique sequences.
## Sample 21 - 122888 reads in 40315 unique sequences.
## Sample 22 - 103522 reads in 37254 unique sequences.
## Sample 23 - 106278 reads in 38028 unique sequences.
## Sample 24 - 45389 reads in 17592 unique sequences.
## Sample 25 - 131341 reads in 43710 unique sequences.
## Sample 26 - 49243 reads in 16616 unique sequences.
## Sample 27 - 29334 reads in 10477 unique sequences.
## Sample 28 - 104203 reads in 37545 unique sequences.
## Sample 29 - 43671 reads in 17089 unique sequences.
## Sample 30 - 97132 reads in 39960 unique sequences.
## Sample 31 - 69682 reads in 25054 unique sequences.
## Sample 32 - 72768 reads in 26365 unique sequences.
## Sample 33 - 97360 reads in 44237 unique sequences.
## Sample 34 - 56868 reads in 18983 unique sequences.
## Sample 35 - 77521 reads in 32367 unique sequences.
# Review DADA object
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 144 sequence variants were inferred from 14187 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
Our data is paired-end sequencing, DNA fragments are sequenced from both ends - the forward read (5’ end) and reverse read (3’ end) of the template strand. This approach provides higher quality data than single-end sequencing because the overlapping regions between reads allow for error validation and longer final sequences. DADA2 merges these paired reads by identifying matching overlap regions and combining them into single, high-confidence sequences, discarding any pairs that don’t align properly.
\(~\)
The mergePairs() funcation requires the dada objects we
created as well as the filtered reads, because not all of that
information is stored in the dada objects we created in the last
step.
\(~\)
# Combine the forward and reverse reads into one
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAGG
## 2 TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTCACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG
## 3 TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGAACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAGG
## 4 TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAGG
## 5 TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAACCTGTGGACTGCATTGGAAACTGTCATACTTGAGTGCCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
## 6 AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCTATGGGCTCAACCCATAAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGGTGTAGCGGTGGAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCCTACTGGGCACCAACTGACGCTGAGGCTCGAAAGTGTGGGTAGCAAACAGG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 9062 1 1 77 0 0 2 TRUE
## 2 2820 2 3 77 0 0 1 TRUE
## 3 2214 3 1 77 0 0 2 TRUE
## 4 2132 4 1 77 0 0 2 TRUE
## 5 1975 5 4 77 0 0 1 TRUE
## 6 1379 6 2 77 0 0 2 TRUE
We can now constuct the ASV table. This table is a matrix of samples (rows) and each ASV identified (columns).
\(~\)
# View the dimensions of our merged samples
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1] 35 5902
Our table contains 35 samples and 5902 ASVs. Now let’s view the lengths of the variants.
\(~\)
# Inspect the distribution of sequence lengths
table(nchar(getSequences(seqtab))) # the length should be around 250
##
## 200 252 253 254 255
## 1 389 5466 44 2
DADA2’s error model corrects for substitutions and indels, but not Chimeras (two different DNA strands megred together). Chimeras are easier to identify in ASVs, as opposed to OTUs, because of their exact sequences.
\(~\)
# Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 4456 bimeras out of 5902 input sequences.
Note: DADA2 uses the term “bimera” specifically because it refers to sequences formed from exactly two parent sequences, “bi-” meaning two. While “chimera” is the broader term that can include sequences formed from multiple parent sequences (three or more), most artificial chimeric sequences in amplicon sequencing are actually bimeras. DADA2’s algorithm specifically looks for these two-parent artifacts as they are the most common type of chimeric sequence formed during PCR amplification.
\(~\)
# New sequence table
dim(seqtab.nochim)
## [1] 35 1446
# Divide the new number of ASVs by the first table.
# This shows us what percentage was removed
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9041567
Here we can see that Chimeras accounted for ~10% of the merged sequence reads.
\(~\)
# Save our data
saveRDS(seqtab.nochim, "seqtabnochim.rds")
Let’s construct a summary table of our steps to see how many reads were removed in the cleaning process.
\(~\)
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim),
final_perc_reads_retained=round(rowSums(seqtab.nochim)/out[,1]*100, 1))
# formatting table before writing to file
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim","percent")
rownames(track) <- sample.names
track
## input filtered denoisedF denoisedR merged nonchim percent
## WP-11O29 114247 39363 39130 38899 35785 35325 30.9
## WP-1641C 180306 63123 62576 61262 53717 50893 28.2
## WP-1J0A0 118458 41544 41085 41021 36690 35654 30.1
## WP-3GH4E 105493 37698 37503 37303 35362 34799 33.0
## WP-4SP0U 245187 83598 82881 82013 71673 64365 26.3
## WP-6RCV3 114269 40530 40341 40292 37855 35649 31.2
## WP-7FAEC 288105 99030 98289 98026 85388 75749 26.3
## WP-8SXNH 115689 39699 39364 39339 36520 35137 30.4
## WP-8V6E3 286174 99900 99028 98606 87580 78614 27.5
## WP-95020 314274 110089 109446 109042 96754 81235 25.8
## WP-9ZJJI 214646 75378 74726 73625 64124 57152 26.6
## WP-ATJLX 101691 36181 35970 35790 33886 32929 32.4
## WP-DFG11 118093 41526 41037 41058 37238 36165 30.6
## WP-EC5AP 90638 31948 31752 31586 29704 29463 32.5
## WP-G375R 105928 36964 36671 36630 33822 32454 30.6
## WP-G9EBL 57168 19481 19278 19033 17383 17321 30.3
## WP-J06MK 223327 77711 77295 76740 67794 63284 28.3
## WP-J0B3X 335549 114355 113660 113368 103320 91594 27.3
## WP-KM8BK 146122 49661 49256 48787 43015 39750 27.2
## WP-KXGT6 132046 45830 45486 45502 41990 40801 30.9
## WP-MRCZG 352149 122888 121987 121730 106122 90633 25.7
## WP-QIGEC 302735 103522 102921 102225 90427 77523 25.6
## WP-RAJES 307883 106278 105429 104844 93042 80190 26.0
## WP-RPV50 131966 45389 45066 44899 41204 40710 30.8
## WP-SUFNU 383563 131341 130268 129981 110386 88913 23.2
## WP-TDY7G 146345 49243 48841 48686 44761 43235 29.5
## WP-U4XI6 85807 29334 29158 28906 27224 26929 31.4
## WP-UIXW3 304129 104203 103303 103001 88260 75767 24.9
## WP-X576L 122710 43671 43368 43158 39647 37792 30.8
## WP-X803O 281887 97132 96371 95702 82639 74452 26.4
## WP-XDI0I 199408 69682 69310 69088 62969 56107 28.1
## WP-YF6XT 210492 72768 72192 72016 65246 59099 28.1
## WP-YRYYF 273738 97360 96780 95716 83187 71752 26.2
## WP-YY8WG 165937 56868 56492 56250 50839 47555 28.7
## WP-Z01LM 229511 77521 77018 76632 68652 63541 27.7
write.table(track, "read-count-tracking.txt", quote=FALSE, sep="\t")
We retained ~25-30% of our original reads! This is likely due to the poor quality we saw at the beginning. Although we now have much less data to work with, it is high quality.
The last step in this model is to read in a taxonomy database. We must first select a database. Two of the most commonly used are Greengenes and SILVA, each with its strengths and weaknesses.
\(~\)
Greengenes2 is a chimera-checked 16S rRNA gene database. It provides chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. Its most recent update was on July 03, 2017.
SILVA3-4 is a high quality ribosomal RNA database. It is a comprehensive and quality-controlled database for up-to-date ribosomal RNA sequences. Additionally, SILVA also provides many other tools like alignment, phylogenetic tree classification, and probe/primer matching. It was last updated on August 27, 2020.
\(~\)
Since our data has chimeras removed and we want the most up-to-date analysis, we will use SILVA.
\(~\)
This step will take several minutes to run.
\(~\)
# Read in Taxonomy
taxa <- assignTaxonomy(seqtab.nochim, "tax/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE)
taxa <- addSpecies(taxa, "tax/silva_species_assignment_v138.1.fa.gz")
# Let's take a look
taxa.print <- taxa # Removing sequence row names for display only
rownames(taxa.print) <- NULL
head(taxa.print)
## Kingdom Phylum Class Order
## [1,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## [2,] "Bacteria" "Firmicutes" "Clostridia" "Oscillospirales"
## [3,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## [4,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## [5,] "Bacteria" "Firmicutes" "Clostridia" "Oscillospirales"
## [6,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## Family Genus Species
## [1,] "Bacteroidaceae" "Bacteroides" "vulgatus"
## [2,] "Ruminococcaceae" "Faecalibacterium" NA
## [3,] "Lachnospiraceae" "Blautia" NA
## [4,] "Lachnospiraceae" "Agathobacter" NA
## [5,] "Ruminococcaceae" "Faecalibacterium" "prausnitzii"
## [6,] "Lachnospiraceae" "Fusicatenibacter" "saccharivorans"
Microbime studies often include mock and/or blank controls to help identify and mitigate possible contamination, ensuring the accuracy and reliability of the data.
\(~\)
Although our study does not have these controls, incorporating both mock and blank controls in the study design can boost quality control. It helps distinguish true biological signals from artifacts introduced during sample processing or sequencing.
\(~\)
# Save our data
saveRDS(taxa, "taxa.rds")
# Run the following command to take the quiz
library(htmltools)
includeHTML("questions/Quiz_Submodule2.html")
In this module, we took a comprehensive journey from the initial stages of data acquisition through taxonomic assignment. We began by downloading and exploring the survey data to understand the metadata associated with each sample. Following this, we downloaded the raw sequencing FASTQ files and preprocessed them using the DADA2 pipeline, which allowed us to perform quality control, error correction, and sequence denoising, resulting in high-quality amplicon sequence variants (ASVs). Finally, we assigned taxonomy to the ASVs using a reference database, enabling us to identify and categorize the microbial communities present in our samples. This workflow provides a foundational dataset ready for downstream analysis, such as community composition comparisons and diversity assessments in the next module.